Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python 3 by Artem Kovera

Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python 3 by Artem Kovera

Author:Artem Kovera [Kovera, Artem]
Language: eng
Format: azw3
Published: 2017-10-21T04:00:00+00:00


The disadvantages of k-means and methods to overcome them

Although the k-means algorithm has some important advantages over other clustering methods, this algorithm has also a number of drawbacks.

The first disadvantage of this algorithm is that depending on the initial positions of the centroids, we can have dramatically different results at the end. So, the k-means algorithm is not deterministic. Also, there is no guarantee that the algorithm will converge to the global optimum. We can partially get around these problems by running the algorithm multiple times and picking the output with the smallest variance. Another solution to this problem is using k-means++ initialization. It is better to use both these methods simultaneously, like we just did using Scikit-learn k-means.

The second and probably most important disadvantage of the k-means algorithm is that we have to specify the number of clusters in advance.

Having some a priori knowledge about the problem can help choose k – the number of clusters. Sometimes we indeed have such knowledge. For example, in clustering astrophysical images, it’s known beforehand that there are two types of brightest objects in the cosmos: galaxies and quasars, so we can determine the number of clusters as two.

When we don’t have enough prior knowledge about the problem, we need to search for a good k. We can implement the algorithm for different values of k and compare the variances of the clusters we get. But in this case, it turns out that the more the clusters, the lower the variance, and, because of this, the best number of clusters is the number of the data points, but it doesn’t make sense, of course.

Instead of choosing results with the lowest variance, we can use the Elbow method. In this method, we run the K-means algorithm with different values of k. We should use a number of clusters such that adding another cluster does not give a substantial difference between the sum of squared errors or, in other words, the ratio of the between-group variance to the total variance.

In the example of using the Elbow method, we will be using the K-means algorithm from the Scikit-learn library and the function cdist for distance computation from the Scipy library:



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.